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Abstract 


We discuss several strategies to implement Dykstra’s projection 
algorithm on NVIDIA’s compute unified device architecture (CUDA). 
Dykstra’s algorithm is the central step in and the computation¬ 
ally most expensive part of statistical multi-resolution methods. It 
projects a given vector onto the intersection of convex sets. Com¬ 
pared with a CPU implementation our CUDA implementation is 
one order of magnitude faster. For a further speed up and to reduce 
memory consumption we have developed a new variant, which we 
call incomplete Dykstra’s algorithm. Implemented in CUDA it is one 
order of magnitude faster than the CUDA implementation of the 
standard Dykstra algorithm. 

As sample application we discuss using the incomplete Dykstra’s 
algorithm as preprocessor for the recently developed super-resolution 


optical fluctuation imaging (SOFI) method (Dertinger et al. 2009). 
We show that statistical multi-resolution estimation can enhance 
the resolution improvement of the plain SOFI algorithm just as the 
Fourier-reweighting of SOFI. The results are compared in terms of 
their power spectrum and their Fourier ring correlation (Saxton and 
Baumeister 1982). The Fourier ring correlation indicates that the 


resolution for typical second order SOFI images can be improved by 
about 30%. 


stephan.kramer@mpibpc.mpg.de 





Our results show that a careful parallelization of Dykstra’s al¬ 
gorithm enables its use in large-scale statistical multi-resolution 
analyses. 
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1 Introduction 


An important topic in image analysis is the rejection of noise and blur from 
digital images. A particular example is denoising and deblurring of micrographs 
in optical microscopy. The unknown signal from object space is convolved with 
the point spread function (PSF) of the imaging apparatus and disturbed by 
Poissonian noise. Digital image recording leads to a projection of the convolved 
signal onto a discrete set of points in space (pixels) and intensities, such that 
the micrograph is a vector of real data of finite length. The blur operator 
representing the PSF is ill-posed (Vogel 2002) which makes its inversion in 
presence of noise a numerically expensive and difficult task. Regularization 
techniques have to be applied, see for example (Facciolo et al. 2009). Although 
the usage of a regularization term stabilizes iterative reconstruction algorithms, 
a regularization parameter has to be introduced, the choice of which is crucial for 
the estimator quality. An inadequate choice of the regularization parameter leads 
to either a loss of details in the image, or to artifacts from the ill-conditioned 
nature of the inversion problem. There are methods to choose the regularization 
in a spatially adaptive, iterative manner (Chen et al. 2006 Grasmair 2009 Dong, 
Hintermuller, and Rincon-Camacho 2011). Spatially adaptive regularization 
methods detect areas of under- and over-regularization in an image and then 
locally adapt the regularization parameter. Despite the success of these methods 
(Rodriguez 2013), the influence of the regularization parameter on the expected 
closeness of estimator to the true unknown signal is often not obvious. 

A slightly different approach to the problem is given by statistical multi¬ 
resolution estimators (SMRE) (Frick, Marnitz, and Munk 2012; Frick, Marnitz, 
and Munk 2013). Common SMRE methods attempt to control the statistical 
behavior of residuals on several scales simultaneously, thus allowing the recon¬ 
struction of image details on different length scales at the same time. Figure [l] 
illustrates the idea. The result is equivalent to a maximum-likelihood estimator 



Figure 1: Left: The main concept of statistical multi-resolution estimation. 
The residual statistics are considered in, possibly overlapping, subsets. The 
distribution of residuals in each subset is required to be a realization of a given 
statistical process, i.e. Poisson or Gauss, with chosen confidence a. Right: 
Multi-resolution adapted to images. The subsets are parts of the image. The 
sum of squared residuals are required, in the example of normal distributed 
residuals, to follow a y 2 distribution. 


with a properly chosen regularization parameter (Chambolle and Lions 1997). 
The appealing feature of an SMRE is that the only free parameter in the 
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resulting algorithm is the desired confidence level a in the hypothesis tests. 
Hence, the choice for the value of a has a sound statistical interpretation. 

However, this comes at the price of an increased computational effort. The 
SMRE is to be computed iteratively. In each iteration the residuals are to be 
controlled in each subset of the image, see for example one of the algorithms 
given in (Frick, Marnitz, and Munk 2013). 

A common method to minimize a convex functional with respect to a set 
of independent variables in presence of constraints is the alternating direction 
method of multipliers (ADMM) (Gabay and Mercier 1976; Hong and Luo 2012). 
One step in the ADMM is the projection of a given realization of residuals onto 
the allowed subset of residuals in M n . Because of the nature of the constraints 
this is the projection onto the intersection of many convex sets in Euclidean 
space. Each set is given by one length scale. A solution to this task is given 
by Dykstra’s projection algorithm (Dykstra 1983; Gaffke and Mathar 1989), 
which projects a given noise estimate onto the feasible set of the multi-resolution 
constraint. A serial implementation of this algorithm is straightforward but 
becomes infeasible as the number of considered scales increases. 

In recent years graphics cards have evolved from simple, dedicated graph¬ 
ics processing units (GPU) to autonomous parallel compute devices within a 
computer, usually referred to as general-purpose GPUs (GPGPUs). With the 
introduction of CUDA (compute unified device architecture) and OpenCL there 
are powerful, yet sufficiently simple APIs (application programming interfaces) 
available for parallel computing. 

This work presents two different parallel implementations of Dykstra’s al¬ 
gorithm optimized for NVidia’s CUDA (Buck 2007). The first one implements 
Dykstra’s algorithm in a general form. The second implementation emerged 
from the work on the first one and introduces a variant which we call incomplete 
Dykstra’s algorithm (ICD). Its key feature is that only those subsets are chosen 
which fulfill the constraints of memory accesses on CUDA devices, i.e. the 
subsets for a given scale have the shape of a tile, do not overlap, cover the 
whole image, the edge length is a power of two and the whole tile fits into the 
shared memory of one multiprocessor of a CUDA device. Despite its approxi¬ 
mate nature, according to our tests presented in this paper, the ICD preserves 
the quality and statistical interpretation of the result obtained from the exact 
algorithm, but significantly improves the execution time. The speedups of the 
CUDA implementations are measured on simulated test data. The power of the 
new ICD is demonstrated by using it as preprocessor for the super-resolution 
optical fluctuation imaging (SOFI) method (Dertinger et al. 2009), a recent 
development in fluorescence microscopy. As the number of frames for SOFI 
applications has to be large enough to resolve the statistical properties of the 
fluorescence signal, for instance in case of quantum dots up to several thousand 
images are required, the performance of an SMRE is crucial for the feasibility of 
applying it as preprocessor. 

This paper is organized as follows. Section [2] contains a basic description of 
the theory of multi-resolution and introduces the ADMM, Dykstra’s algorithm, 
SOFI and the Fourier ring correlation (FRC) for the quantitative assessment 
of the resolution improvement. Section [3] highlights the key features of our 
implementation. For a complete listing see the accompanying source code. In 
Sec. [4] we discuss our results, which include tests of the performance and the 
resolution improvement of the two variants of Dykstra’s algorithm. To study 
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the performance on real data obtained from an experiment we employ SMRE as 
pre- and postprocessor for SOFI. Finally, Sec. [5] gives a conclusion. 


2 Statistical Multi-resolution Estimation 

In statistical signal processing the noisy measurement I G M m of the signal 
x G M n is formulated as 


I = A * x* + e , (1) 

where A is the PSF of the imaging apparatus, x* is the true signal underlying the 
measurement / and e is a noise vector. The convolution of A and x* is denoted 
as A*x*. The estimates for the true signal and the noise, based upon knowledge 
of /, are written as x and e, respectively. In practice, both are unknown. The 
linear operator A is usually ill-posed which renders the inversion of Eq. 0 
infeasible in any real world situation. In order to overcome this difficulty the 
problem is augmented with a regularization term R(x). A common example for 
the regularization is total variation (TV) 


-Rtv (x) = II^IItv := F V |(( Vx )i)* 


i =1 k =1 

where d is 2 for planar images, and 3 for image stacks. The latter frequently 
occur in confocal microscopy. The outer sum over i is over all pixels and the 
inner sum computes the L\ norm of the local intensity gradient. TV estimators 
perform well on natural images (Facciolo et al. 2009]), which are expected to 
consist of smooth areas and sharp boundaries. The term R may be substituted 
with a more problem-specific term, see for example (Diekmann et al. 2001). 

A multi-resolution estimator recovers the most regular vector 

x = argmuq, e G (e) + R (x) s.t. / = A * x + e (2) 

with respect to ii(x), which fulfills the multi-resolution constraint 


G(e) 


0 if max 8g n c a (a) e? < 1 


OC 


else 


(3) 


The function G is a reformulation of the constraint that the noise e has to 
be normally distributed, e ~ 7V(0,cr 2 ). Using an Anscombe transformation 
(Frick, Marnitz, and Munk j2013) the Poissonian noise of digital cameras can 
be transformed such that it approximately obeys a normal distribution. The 
image / G M m contains m pixels, which are indexed by the index set {1, ...,m}. 
A multi-resolution analysis works on subsets s C {1, ...,m} of the pixel indexes. 
The set of all subsets of pixel indexes is denoted as 

ft = {« I sc {1 ,to}} . (4) 

The feasible set generated by a subset s of pixel indexes is denoted as 

t s (x) := {x £ K"|c s (a) < 1} . (5) 

i£s 
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( 6 ) 


The intersection of all r s defines the feasible set 


tg 


it 

sen 


of the multi-resolution constraint G. 

In practical applications Vt will be restricted to connected areas of only a 
few pixels since the computational effort increases with |0|, the cardinality of 
O. Then G computes the maximum squared residual on all subsets s G O. 
With an appropriate choice of the weights c s (a) one can ensure that for each 
e G {e £ M m | G (e) < 00 } the noise e\ s in the pixels belonging to a subset sGH 
is Gaussian distributed with a chosen probability of at least a. Hence, the 
estimator x is the smoothest image in the sense of R which satisfies the multi¬ 
resolution constraint enforced by G, as given in Eq. |3|. This relation is visualized 
in Fig. [2j 



true solution x 
inside w. prob. a 



Figure 2: Left: The feasible sets are generated by the subsets s E O. Their 
dimensions are given by the cardinalities of the subsets s. The estimate x is 
inside of the intersection of all spheres r s . Right: The unknown true solution 
x * is inside of the feasible set with probability a. 


2.1 Selection of the weights 

The weights c s (a) are required to ensure a balanced hypothesis test among all cho¬ 
sen subsets s G H of the image plane. The probability that max sG o c s («) Eies 4 < 
1 is required to equal a for all sets. A method to choose c s (a) accordingly is 
given in (Frick, Marnitz, and Munk 2013]). The function 


ts (e) = F 


ies 


measures the sum of squared residuals in a set s. Its fourth root transform 


(£ s (e)) 3 ~iV pi s = (M -0.5)' 


cr* = 


1 


8vW 


is approximately normally distributed. With this transformation each s con¬ 
tributes equally to the extreme value statistics 


Qn = max 


(t s ) 4 - n s 
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Then, an appropriate choice for the weights is 


c s (a) = 


1 

(q a Vs + Ms ) 4 


In this equation q a is the a quantile of Qq. It follows that 


P 


I ma xc s (a)t s < 1 

y 


= O', 


i.e. the probability that Gaussian noise violates a single constraint is a. The 
probability to violate the constraint is balanced for all subsets by the choice of 
the c s (a). Hence, the unknown true signal x* itself fulfills the constraint with 
probability cq 


P(G(I-A*x*)<l) = a. 


(7) 


Moreover, one can show (Frick, Marnitz, and Munk 2013) a sound statistical 
relation between the estimated signal x and x* with respect to the regularization 
R for the estimator in Eq. © applied to Eq. 0, 


P(R{x) <R(x*)) >a. 


( 8 ) 


The probability that a feature in the recovered signal x is in fact a feature of 
the true image x* and not an artifact arising from incomplete noise removal is 
at least a. 


2.2 Alternating Direction Method of Multipliers 

The numerical minimization of Eq. § can be tackled by using the ADMM 
(Gabay and Mercier 1976 Hong and Luo [2012). The ADMM is a method to 
minimize a convex functional depending on a set of variables in the presence of 
constraints. The constraints are enforced by augmenting the functional with 
Lagrangian terms T, one for each constraint. The convergence speed is enhanced 
by adding a quadratic penalty term || I — (A * x + e) 


11, weighted by a real 


number p > 0 (Hong and Luo 2012). The constrained minimization problem in 
Eq. § then corresponds to the unconstrained minimization with respect to the 
primal variables x G M n and e G M m , and maximization with respect to the dual 
variable T G M m of the objective functional 


C(x,e) = G(e) + R(x) + ^\\I - (A*x + e)\\l 
+ (T,I - (A* x + e)). 


(9) 


The solution x = argmin^ e maxy C (x, e) is obtained iteratively, cf. Algorithm 1 
The estimate after r iterations of the ADMM is denoted as x r . In the limit 
r oc it converges towards the minimum, x r —>• x. The minimization with 
respect to x can be evaluated in linear approximation, without damaging the 


overall convergence (Hong and Luo 2012). The update rule for T as given in 
[Algorithm 1| corresponds to the method of steepest ascent. To simplify the 


minimization we decouple the deconvolution and smoothing by introducing 
another variable z. Hence the problem 


x = argmhq, e G (e) + R (z) s.t. / = A*x + e, z — x , 


( 10 ) 
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Algorithm 1 Alternating Direction Method of Multipliers 

choose S > 0, xq G M n , Y 0 , eo G M m , p > 0 
while \\x r — x r —i ||2 > S and \\I — (A * x r + e r ) ||2 > S do 
x r+ i = arguing £ (x, e r ) 
e r+1 = argmin e C (x r+ 1 ,e) 

T r _|_i = T r + ex (/ — (A * x + e)) 
end while 


has two constraints. This leads to the augmenting variables Yi, Y 2 , pi, and p2. 
The result x is not altered by this additional constraint, but the individual steps 
in the ADMM become easier to solve. The introduction of z does not perturb 
the convergence of the ADMM in linear approximation of C (x, e r ). To further 
reduce the error in each iteration step a stabilization term ^ 11 rrr 11 § is introduced, 
which leads to a modified update rule 

x r+ i = argmin^ ^-\\I - A * x - e||| + (Yi,/ — A*x — e) + 

+ ^\\ X ~ Z \\l + (T 2,X — z) 

which can be used in its linearized version 

x r+1 « argmin^ 7 ||x || 2 + (T 2 - A T * (p 1 (/ - e) + Ti) - p 2 z,x) . ( 11 ) 

Here, A T denotes the transpose of the matrix A. The minimization with respect 
to x has no closed form since A is ill-posed. As long as 7 is chosen sufficiently 
large, x r still converges to x for large r. The minimization with respect to e can 
be rewritten as a projection on the intersection of convex sets, 

e r+ i = Pg^I — A*x+—^, ( 12 ) 

where, Pg(-) is defined as 

P G (X) := argmin e {G(e) + ||| e —X||^} . (13) 


2.3 Dykstra’s Algorithm 


The problem of projecting on the feasible set tq of the multi-resolution constraint 
G is solved using Dykstra’s algorithm (Dykstra |1983 Gaffke and Mathar 1989), 
which requires the knowledge of the projections p s on each of the sets r s (e). In 
general, the projection on a set r s (x) is defined as 


Ps (x ie3 ) 



if c s (a) Y,ies x i > 1 
else 


(14) 


where ||x ||2 s := ^Z ies x\ is the L 2 norm with respect to the pixels belonging 
to the set sGO. This projection is to be performed in each iteration of the 
ADMM. Therefore, its performance is critical for the overall minimization. The 
projection on tq is computed iteratively as shown in [Algorithm 2| The algorithm 
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Algorithm 2 Dykstra’s Algorithm 

choose S > 0, given xo, qo G M n with go,s = 0 Vs G 
while ||x r — x r -i ||§ > S do 

^r+l = Ps (^r ^r,s) 

Qr-\-l,s ~ %r -\-1 

end while 


is known to converge towards the projection of x$ on the intersection of the 
s G The iteration is stopped when the change in x r is below some chosen 
threshold S. In the context of image processing the sets s denote small subsets 
in the image. The projection on two sets si,S 2 G H can be calculated in parallel 
if si n S 2 = 0. However, the order in which the projections are performed is 
crucial for the convergence of the algorithm. Therefore, only non-overlapping 
subsequent projections can be calculated in parallel. 


2.4 SOFI 


Recently, several methods have been developed to overcome the diffraction 
limit (first derived by Abbe) on resolution in optical microscopy, such as PALM 
(photo-activation localization microscopy) (Hess, Girirajan, and Mason 2006), 
STORM (stochastic optical reconstruction microscopy (STORM) (Rust, Bates, 
and Zhuang 2006| ) SIM (structured illumination microscopy) and its nonlinear 
variant (Gustafsson 2005]) and most notably the STED (stimulated emission 
depletion) (Hell and Wichmann 1994]) microscopy which was awarded the Nobel 
prize in Chemistry, 2014. All of these methods require to some extent spe¬ 
cial sample preparation techniques making these methods more elaborate and 
complicated than traditional fluorescence wide-held microscopy. 

A different approach is SOFI (super-resolution optical fluctuation imaging) 
(Dertinger et al. 2009 Dertinger et al. 2010) which infers the additional infor¬ 
mation necessary for resolution improvement from the temporal behavior of the 
signal of the imaged object which in case of fluorescence microscopy works as 
follows. Given N point-like emitters, the fluorescence signal at an arbitrary point 
r G M 3 (in practice this is a pixel) in the image plane can be written as 


= 'Y^U(y -v k )e k s k {t), 


(15) 


where U : is the PSF, which is entirely determined by the optical 

system and time-independent. In the following we restrict the discussion to 
the focus and image plane, i.e. d = 2. The molecular brightness of the kth 
huorophore is denoted as E\~ G M + and the stochastic time-dependence of its 
emitted fluorescence is Sk(t) : M [0,1]. With (.. .) t as average over time, the 
fluctuations SF(r,t) = F(r,t) — (F(r,t)) t of the observed fluorescence are given 
by 

5F(r,t) = yy(r — r k )e k 5s k (t). (16) 

k 

To motivate the idea underlying the SOFI signal we consider the two-point 
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correlation function 62 ( 1 *, r) = ($F(r, t + r) • 5F(r,t))t which by Eq. (15) is 


C 2 (r,r) = y^C/ 2 (r-r fe )£fe((5s fe (t + r) ■ 6s k (t)) t • (17) 

k 

To get a value equivalent to the intensity the final SOFI signal is given by the 
integrated correlation 


Isofi(t) 


£t/ 2 (r- 


*k)e\ 


+ OO 

J ( Ss k {t + T ) 


Ssk(t))t d,T. 


(18) 


In general, instead of correlation functions cumulants C n are used, where n 
denotes the order. The advantage of cumulants is the absence of correlations of 
orders less than n. Then, the intensity Isofi(? ) assigned to a pixel in the final 
SOFI image is proportional to the nth power of the PSF. Usually, a Gaussian 
profile is a good approximation to the shape of the PSF and thus resolution is 
improved by a factor of \/n if a cumulant of order n is used. However, at the 
same time inhomogeneities in the molecular brightnesses are amplified as well, 


as already the simple case of order 2 in Eq. (18) shows. 


A common simplification, especially for higher order SOFI methods, is to 
approximate the integral by the maximal value of the integrand, which for order 
2 is simply the variance of the intensity in a pixel 


C 2 (r,0) = (F 2 (r,t))t-(F(r,t)) 2 t . 


(19) 


For higher orders the cumulants can be computed from the moments of the 
probability distribution of the fluctuations. 

The strength of SOFI is that it can be combined with a variety of microscopy 
techniques without any need for modifying the experimental setup as it is a 
software-only, pure post-processing method. Its disadvantage is that it needs a 
sufficient amount of images to resolve the statistical behavior of the temporal 
fluctuations of the observed (usually fluorescence) signal. However, STORM has 
the same drawback. 


2.5 Fourier ring correlation 


In order to assess the quality of an estimator on experimental data an estimate 
for image resolution is necessary. The Fourier ring correlation (FRC) quantifies 
the image resolution from correlations in Fourier space and the impact of noise on 
those correlations. It was specifically designed for the case where the true signal is 
not known. Originally, the FRC was conceived as resolution measure for electron 
microscopy images (Saxton and Baumeister 1982). Recently, it was proposed as 
resolution criterion for optical super-resolution microscopy (Nieuwenhuizen et al. 
Banterle et al. 2013]). 


2013 


To compute an FRC the correlation along circles in the Fourier plane is 
calculated for different realizations of the noise. The FRC defines image resolution 
as the smallest distance at which the correlation of two considered realisations 
of the same statistical process drops below the predicted correlation for pure 
noise. Given two realizations /i ,/2 €= M n of the same noisy image acquisition 
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process, the correlation of the Fourier transformed images A, I 2 , constrained to 
a ring in distance r to the origin, is calculated as 


FRC(r) 


Er ie r J l( r i) •4*( r i) 



( 20 ) 
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Figure 3: The block diagram illustrates the mapping of parallel processing of 
data (left) to the programming structure of CUDA (middle). This structure is 
mainly modeled following the hardware design of GPUs (right), especially the 
memory hierarchy. 


3 Implementation 


The core of an implementation of an SMRE is the ADMM algorithm for mini¬ 
mizing Eq. (10). The operator A is approximated as Gaussian kernel with the 


standard deviation ctpsf given as input parameter, cf. the parameter listing in 
the appendix. The implementation is based on the SciPAL library. 

The numerically expensive part is the implementation of Dykstra’s algorithm, 
cf. [Algorithm 2| We first discuss the implementation which strictly follows the 
mathematical description as given in Sec. |2.3| which we call the exact method 
(Sec. |3.1| ). Then we discuss our incomplete method (Sec. |3.2| ), which, according to 
our numerical experiments, is equivalent to the exact method, but much simpler 
to implement because it partitions the image in non-overlapping sets right from 
the beginning. For both variants the runtime mainly depends on the maximum 
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size of the subsets and the size of the image. In general, subsets and images may 
have virtually any shape. Therefore, we introduce an effective edge length L s 
for the subsets and Lj for images. Both are defined as the square root of the 
number of pixels in a subset and an image, respectively. We further define the 
maximum multi-resolution depth 

^smre •= log 2 (ma xL s j (21) 

and the associated maximum resolution scale 

isMRE := 2 Ksmre . (22) 


For readers not familiar with the CUDA programming model we provide a brief 


description in Appendix [ 
the GPU is given in Fig. I 


A graphical summary of our way mapping data to 


3.1 Dykstra’s Algorithm (exact Version) 

The parallelized implementation of Dykstras’s algorithm is the key to reach high 
speedups. Figure [4] shows simplified block diagrams of the serial implementa¬ 
tion |(a)| of the ICD |(b)| and of the exact method |(c)| Only non-overlapping 
subsequent projections p s can be calculated in parallel and the sequence of over¬ 
lapping subsets needs to be preserved. This constraint is the principal guiding 
line for the implementation. 

Each group is limited to a maximum size of 1024 pixels per group if each 
thread works on one pixel as this is the maximum number of threads per thread 
block. If the next set to be added to a group during creation exceeds this limit a 
new group is formed, leaving some threads idle in the former group. Therefore, 
in general not all threads of a thread block are in use. To avoid blocking behavior 
within a group, the subsets of pixels must be chosen such that only projections 
on mutually non-overlapping sets occur. The groups of subsets are stored in a 
linked list, in the following called execution queue, which defines the sequence in 
which they are processed. The order in which projections are carried out has to 
be such that the distance between overlapping subsets is large. In this context, 
distance is meant with respect to the positions in the execution queue. 

The efficient computation on GPUs faces two further problems, (i) The 
limited amount of VRAM is not sufficient to store all necessary data on the 
GPU. The memory needed to store all q s variables crucially depends on the 
multi-resolution scale Tsmre and the number of pixels in an image Lf, and is 
^(•^smre L 2 i). For instance, for a 1024 x 1024 pixel image and Tsmre = 15 this 
amounts to 14 GB for single-precision, (ii) To populate all the SMs in order to 
achieve good efficiency when calculating the parallel projections. The classical 
approach of starting a single CUDA kernel is not efficient due to the blocking 
behavior of overlapping subsequent projections and the insufficient amount of 
memory on the GPU. We have to use the conventional RAM on the CPU side as 
additional buffer for temporary data which requires frequent memory transfers 
before and after the execution. Instead, we make use of CUDA’s Hyper-Q feature 
to implement a multi-stream based parallelization. 

A key element of our implementation is a pool of worker threads on the 
host side, where each one uses a CUDA stream to process items received from a 
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(a) serial 




(c) exact 


(b) incomplete 


Figure 4: Differences between a serial implementation (a) and the presented 
parallel exact |(c)| and incomplete [(b)! implementations. Only non-overlapping 
subsequent projections can be calculated in parallel and the sequence of overlap¬ 
ping subsets needs to be preserved. This constraint shaped the implementation 
of the exact method, as it has to check for overlapping (blocking) subsets. The 
performance of the exact method is mainly limited by the transfer rate of the 
PCIe bus due to the limited RAM on the GPU. See Fig. [5] for a detailed discus¬ 
sion on the CUD A kernels of the parallel projection methods. The incomplete 
method works on a set of subsets which is tailored for concurrent execution. 
This increases the performance. The serial approach performs the projections 
one after another. 
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shared queue. The queue holds the groups of subsets. Before adding the groups 
to the queue, the main worker thread verifies that the group to be added does 
not overlap with groups already queued. If an overlap is detected, the thread 
waits until the blocking group is processed and removed from the queue. Only 
the main thread which adds groups of subsets to the queue is subject to long 
blocking behavior. The GPU always has a high work load as memory transfers 
and instruction execution are done concurrently on multiple streams. Listing l] 
shows the stream handling in the worker thread which coordinates the work on 
a CUDA stream. It retrieves several groups of subsets at once from the queue, 
and then adds copy and execution instructions to a CUDA stream to process 
a list of groups (a cluster) in parallel. For brevity, uninteresting parts of the 
source code like error checking have been omitted from the listing. Groups are 
designed to be processed by a thread block. This implementation launches the 
kernel in a grid of many thread blocks. The memory transfers are sufficiently 
large to make the use of multiple streams an effective latency hiding mechanism. 


Listing 1: The parallel worker threads which control the CUDA streams 


i 

3 

5 

7 


11 

13 

15 

17 

19 

21 


23 

25 

27 


29 

31 

33 

35 

37 

39 


// This function is executed by ‘stream_count ‘ threads in 
// parallel , each creates an CUDA stream and coordinates 
// the work done on that stream. It removes items from 
// the queue and processes them. End thread when terminate 
// item is recieved from the queue . 

// ‘group* is a class which contains a group of subsets 
// ‘T* can be float or double 

templateCtypename group, typename T> 
void stream_handler() { 

// CUDA device id has to be set by each thread 
cudaSetDevice(inf->device_id); 

// Create CUDA stream 
cudaStream_t mystream; 
cudaStreamCreate(femystream); 

// Remove several groups from queue on each get() call 
std::list<group*> cluster_vec; 

// Maximum number of groups we try to get from queue 
const int max_num_cluster = 128; 

// Minimum number of groups we try to get from queue, not 
guaranteed 

const int min_num_cluster = 12; 

// Allocate the maximum needed memory on host and device for 
// cluster information and q_offset 

// Array of device pointers which contains the information 
about the clusters 

SciPAL::Vector<* int , blas> cluster_info(4*max_num_cluster) ; 
SciPAL::Vector<* int , cublas> cluster_info_d(4*max_num_cluster); 

// Where to find the q values for each thread block 

SciPAL::Vector <int , blas> q_offset(max_num_cluster); 

SciPAL::Vector< int , cublas> q_offset_d(max_num_cluster) ; 

// All q values 

SciPAL::Vector<T , cublas> q_d(1024*max_num_cluster ) ; 
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step35::Kernels<T> kernels; 

// Get a list of min_num_cluster to max_num_cluster clusters 
cluster_vec = queue->get(min_num_cluster , max_num_cluster); 

// Big while loop until terminate signal is received 
while ( cluster_vec.front()->size != 0 ) { 

// Number of groups we received from queue 
int num_of_cluster = cluster_vec.size () ; 

// Fill arrays q_offset and cluster_info with the 
// device pointers to the desired information 
// ... 

// In the big q array, where can I find the q for 
II the first pixel of my cluster? 
unsigned int offset = 0; 

// Copy q for all frames to device 
for (auto it=cluster_vec.begin(), 

end=cluster_vec.end(); it!=end; ++it) { 

cudaMemcpyAsync(&(q_d[offset]), ( * it)->qmat, 

(*it)->size*sizeof (T) , 
cudaMemcpyHostToDevice, mystream); 
offset+=(*it)->size; 

} 

// Add copy command of cluster_info and q_offset 
// from host to device to CUDA stream 
// ... 


// Start a Dykstra 5 s CUDA kernel on all frames 
// the list of clusters we received 
kernels.dykstra(q_d, inf->e_d, cluster_info_d, 
num_of.cluster, inf->width, inf 
1024, femystream); 


in 

q_offset_d , 
->height, 


// Add 
// our 
// ... 


copy command of q from device to host to 
CUDA stream 


// Block until our CUDA stream has completed all operations 
cudaStreamSynchronize(mystream); 

// Signal queue that task is done 
queue->task_done(cluster_vec); 

// Get a list of min_num_cluster to max_num_cluster 
// groups from queue 

cluster_vec = queue->get(min_num_cluster, max_num_cluster); 
} // End big while loop 


// Worker thread shutdown cleanup 

// ... 


3.2 Incomplete Projection 

The exact variant has some limitations which hinder an efficient CUDA im¬ 
plementation, such as the PCIe bottleneck and the overall complexity of the 
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detection of overlapping sets. To further adapt Dykstra’s algorithm to the 
execution on a CUDA device we now restrict the eligible subsets of the image 
plane to only those which are formed by squares with edge lengths of powers 
of 2, up to Z/smre = 32. Fig. [4] shows a general overview of the discussed 
implementations, Fig. [5] shows the difference in the chosen subsets between the 
exact and incomplete implementation. 

All shifts which are a power of 2 up to 32 of these squares are considered. 
Consider the set of non-overlapping squares in the image plane with edge lengths 
2 fc , with k < iFsMRE := 5. If the origin of each square is shifted by 2 k in each 
direction the pattern is mapped onto itself. Only shifts by 2 l pixels in each 
direction with i < k pose a mapping on a different pattern. The set (l l k denotes 
all non-overlapping squares with an edge length 2 k which are shifted by 2 l pixels. 
Then construct the sets 


f V 


n «£• 


i<k<5 


The incomplete projection 


-^incomplete (^) — | 


i =0 


is computed as the projection on each SI 1 subsequently. The resulting residual is 
still in the allowed set of G in Eq. © , which preserves the statistical interpretation 
of the result. 


The projection on a set ft 1 is calculated in a single CUDA kernel (Listing 3). 


The kernel takes as input a pointer e to the residual vector on the device, the 
dimension of the vector (ni, nj) and the offset (offseti, offsetj) from where 
to generate the subsets. The shifting is carried out as shown in |Listing 2| by 
calling the kernel with different offsets i. This increases the number of considered 
subsets. The edge length 2 smin of the smallest subsets can be specified to avoid 
unnecessary duplicate computations as a result of the shifting. As in the kernel 
for the exact projection we make use of shared memory for the temporary arrays 
q, s_l, s_2. The exact projection kernel uses 12 bytes of shared memory in case 
of single-precision and 24 bytes for double-precision per thread, the incomplete 
kernel uses 32 bytes for single-precision per thread. The shared memory is limited 
to 48KB per thread-block. For a reasonable implementation of the ICD we need 
the full 1024 threads per thread block rendering computations in double-precision 
infeasible on all existing generations of the CUDA architecture. 


Listing 2: Wrapper function around the incomplete Dykstra kernel 



template < typename T> 

2 

void ICD_handler () { 


int gridsi = info->width/32; 

4 

int gridsj = info->height/32; 


dim3 grid(gridsi ,gridsj) ; 

6 

dim3 blocks(1024,1) ; 


for (int i = 0; i < 5; i++) { 

8 

__incomplete_dykstra<TX<<grid ,blocks>>>(info->e_device , 


info->width, 

10 

info->height, 


2~i, 2 ~i, i); 

12 

} 


> 
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(a) Thread block pattern used in both 
incomplete and exact method 
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(c) Thread block pattern used only in 
exact method 


(d) Thread block pattern used only in 
exact method 


Figure 5: Visualization of the differences between the exact and the incomplete 
method. The illustrations show 32 x 32 pixel sections of the larger image, each 
section is processed by one CUD A thread block of 1024 threads. The incomplete 
method makes optimal use of thread blocks by using subsets of D of edge length 
1, 2,4, 8,16 and 32 pixels which fill up a 32 x 32 pixel section. The subfigures (a) 
and m show the pattern of subsets of edge length 16 and 4 pixels respectively. 
While the exact method uses all subset patterns of the incomplete method it 
also includes subsets of D with all edge lengths and all possible shifts within the 
image. The patterns (c) and |(d)| are thus only used in the exact method and not 
in the incomplete method. 
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Listing 3: incomplete implementation of Dykstra’s algorithm 


// 

// 

// 

// 

// 

// 

// 


CUDA adapted incomplete Dykstra projection 
Oparam e pointer to the residual 
@param ni height of the image 
@param nj width of the image 

@param offseti offset in vertical direction 
@param offsetj offset in horizontal direction 
@param smin minimum subset size is 2 SItlin 


// C T C can be float or double, on current hardware only float is 
supported 

templateCtypename T> 

__global__ void 

__incomplete_dykstra(T *e, const int ni, const int nj, const int 
offseti , const int offsetj , const int smin) { 

// Allocate temporary arrays q, si, s2 in shared memory 

// ... 


// Set q = 0 

for (int s =5; s> = 0; s —) { 

q[s*1024+threadldx .x]=0; 

> 


// Temporary variables 
unsigned int is, js, idx; 

// Tolerance 

const T tol = TOLERANCE; //= le-3 


// 

Do parallel projections 

until convergence test passes 

T 

delta = 2.0*tol ; 



wh 

ile ( delta > tol= 

*1024.0 

) { 


delta = 0; 




// Wait for all 

threads 

before starting the iteration 


__syncthreads () ; 




// In one threadblock we apply dykstra’s algorithm to 
// subsets with edge lengths 32, 16, 8, 4, 2 and 1. 

// Each thread processes one pixel, 
for (int s = 5; s >= smin; s - -) { 

// Edge length of the subset = 2 s 
int SubsetLength = pow_of_two(s); 

// Number of subsets in one threadblock = 2 5-s 
int SubsetNum = pow_of_two(5 - s); 

// Number of pixels in each subset = 2 2 ' s 
int SubsetSize = pow_of_two(2*s); 

// Get Line in global image, assign to is 

// ... 

II Column in global image, assign to js 

// ... 

II For this iteration this thread is supposed 
// to process the pixel (is, js), the pixel 
// index idx in the global image is given by: 
idx = is *n j + js; 

// Fill shared memory with variables we use later 
si [threadldx .x] = e[idx] - q[s *1024+threadldx .x]; 
s2[ threadldx .x] = si [threadldx .x]*s2 [threadldx .x]; 
// Wait for all threads 
__syncthreads () ; 

// Index to first pixel of my subset 
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61 


int SubsetStartIdx 
SubsetSize ; 


(threadIdx .x/SubsetSize)* 


63 


65 


{ 


67 


71 

73 


// Sum over all pixels of one subset , write the result 

// to s2[SubsetStartIdx] 

while (m <= pow_of_two(2*s-1)) { 

if (threadldx.x - SubsetStartIdx + m < SubsetSize) 

s2 [threadldx .x] += s2[ threadldx .x + m] ; 

} 

m = m << 1; // m = m*2 

__syncthreads(); 

> 

/ / q = X r +1 — x r 

q[s*1024 + threadldx.x] = -si [threadldx .x] ; 


75 

77 

79 

81 

83 


// Eq. ( |14| ) from Dykstra 5 s algorithm 

// ‘cs ‘ is a array allocated in CUDA constant memory 
if (cs[s]*s2[SubsetStartIdx] > 1.0) { 

si [threadldx .x] /= sqrt(cs[s]*s2[SubsetStartldx]); 

> 

// Update q 

q[s* 1024+ threadldx .x] += s1[ threadldx .x] ; 

// Calculate increment , mabs is our abs function 
delta+=mabs(e[idx] - si [threadldx .x]); 


85 

87 


89 

91 } 


} 


// Update the estimate of residual 
e[idx] = si [threadldx .x]; 

// Wait for all threads before next step 
__syncthreads(); 
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4 Results 


The CUD A implementation of the new ICD variant is found to be 100 times 
faster than a serial implementation of Dykstra’s algorithm and 10 times faster 
than the CUD A implementation of the exact version. Yet, it reconstructs images 
nearly as well as the exact implementation, i.e. for the naked eye the results 
are indistinguishable. Both implementations proved robust against noise and 
blur. The runtime performance is tested on synthetic images of different sizes. 
Because of the enormous memory consumption of the exact variant of Dykstra’s 
algorithm we had to vary the SMRE depth Asmre for the different tests in 
order to keep the computations feasible. As an example of a real-life application 
we use the ICD as pre- and postprocessor for the SOFI algorithm which gives a 
further increase in resolution of up to 30%. 


4.1 Test Data 


To assess the quality of the estimator we use synthetic data generated from 
the standard Lena image with simulated noise and blur. For synthetic data 
the true signal is known and the quality of the reconstruction algorithm can be 
estimated directly. Figure |6a| shows the unperturbed standard Lena test image. 
The simulated test image (Fig. |6b| ) is generated by blurring the original image 
with a Gaussian kernel of width ctpsf = 4 px (px = pixel) and adding Gaussian 
noise with a standard deviation of a = 1. The noise strength a = 1 corresponds 
to about 0.5% of the maximum signal and is given in gray levels. We generated 
further test images with smaller signal to noise ratios, a = 3 and a = 10 (not 
shown), for comparison. 

To assess the performance in real applications we study the combination of 
SMRE with SOFI (see Sec. 2.4) on experimental data. 

The final test of the exact and incomplete implementation of Dykstra’s 
algorithm is done on optical widefield microscopy data, previously analyzed 


with the SOFI framework (Huss et al. 2013) in order to study the details of 
intracellular trafficking and assembly of GABA-B neurotransmitter receptors 
in hippocampal neurons. The data set represents a time series of the temporal 
fluctuations of the fluorescence signal of a nerve cell labeled with quantum 
dots which attach to the receptors. The fluctuations of the fluorescence stems 
from the blinking behavior of the quantum dots which is random and follows a 


power-law for the on- and off-times (Kuno et al. 2001). The movie consists of 


3091 frames. Besides the considerable amount of out-of-focus light due to the 
widefield illumination, each image shows the noise and blur typical for optical 
imaging. A sample frame from the raw data is shown in Fig. [7a] The blur occurs 
due to the diffraction limit. The microscope records the real object convolved 
with the PSF. We approximate the convolution kernel as Gaussian. In terms of 
the SMRE framework this is the measurement operator. 

The images are typically acquired in the photon-limited regime. Hence 
one has to deal with Poissonian noise disturbing the image, i.e. to apply the 
Anscombe transformation before running the SMRE. 
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(a) Test image, true signal. 


(b) Test image with simulated blur and 
Gaussian noise of strength cr — 1. 




(c) Reconstruction using exact implemen- (d) Reconstruction using incomplete im- 
tation, noise strength a = 1. plementation, noise strength a = 1. 



(e) Reconstruction using incomplete im- (f) Reconstruction using incomplete im¬ 
plementation, noise strength a = 3. plementation, noise strength a = 10. 


Figure 6: Test of the algorithms on the Lena image, for details refer to Secs. 4.1 


and 4.3 
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(a) original frame (b) SMRE reconstruction 


Figure 7: Sample frame from the experimental time series of the hippocampal 


neurons (Huss et al. 2013) (a), and its SMRE reconstruction (b) 


4.2 Performance 


Figure 8a compares the runtime of 100 ADMM iterations (in each iteration 


step Dykstra’s projection algorithm is run until convergence) between a serial 
implementation on a CPU (Intel Xeon X5675) and the exact and incomplete 
implementation on a NVIDIA Tesla K20c GPU. For all variants we used a 
fixed tolerance of 10 -3 for the Dykstra algorithm. For the exact GPU and the 
serial CPU implementation the runtime depends on the number of subsets. The 
number of subsets for the incomplete GPU implementation only depends on the 
edge length of the image L/. The plots do not include the startup time required 
for the initialization of the algorithms. While the main factor of the startup time 
for the incomplete implementation is the time needed to load the image, the 
startup time for the exact and CPU variants depends on the number of subsets. 
The latter algorithms need to preallocate memory for each q s array of every 
subset s. For the plots the maximum multi-resolution scale is Tsmre = 15 to 


make the computations feasible. As explained in Sec. 3.1 the exact method then 
requires 14 GB of memory when run in single precision. The runtimes of the 
different methods are considerably different. The incomplete method achieves 
an overall speedup of up to 100 over the single-threaded CPU implementation 
and is 10 times faster than the exact method. 


4.3 Estimator quality 

Based on the standard Lena test image, cf. Sec. |4.1[ we analyze the quality 
of the statistical multi-resolution estimator. Figure [6] summarizes the results 


from both implementations for am 1 (Figs. 6c and 6d) and the ICD results 
for a = 3 (Fig. |6e| and a = 10 (Fig. [6f]). All reconstructions use a maximum 
multi-resolution scale of Tsmre = 32. The reconstructed image obtained from 
the exact implementation of Dykstra’s algorithm (Fig. |6c| is visually almost 
indistinguishable from the result obtained from the ICD variant (Fig. 6d), which 
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Figure 8: Runtime (a) and speedup (b) for 100 ADMM iterations. The exact 
GPU variants and the CPU variants use a maximum multi-resolution scale 
Lsmre = 15. All runtimes given exclude the startup time. 


is quantitatively confirmed by the FRC, cf. Fig. |9a| 

The FRC as described in Sec. |2.5| provides a quantitative measure of the 
resolution. Figure 9a shows the FRC of the Lena images displayed in Fig. [6j 
Visual inspection and the FRC indicate that the SMRE methods improve the 
image resolution and that the results of the exact and incomplete implementation 
are nearly identical. 

An important aspect of an estimator is robustness against noise strength and 
PSF radius. Typically estimators tend to decrease in quality with increasing 
noise strength and PSF radius. This behavior is investigated numerically in 
Fig. |9b| and Fig. [9cJ where the robustness against noise and PSF radius is 
plotted as L 2 -error \\x — x*|| 2 of the reconstruction x with respect to the true 
signal x*, normalized by the error \\I — x*|| 2 of the simulated/measured image /. 
Values below 1 indicate an improvement in quality of the reconstruction over 
the measurement. Instabilities in the estimator become visible by observing the 
proposed ratio. Figure |9b| shows the robustness against noise strength for fixed 
PSF radius ctpsf = 4px. The estimators are robust even for large noise. For 
small noise is the incomplete estimator slightly less efficient. The plot indicates 
that the quality of the exact estimator is better in terms of the L 2 -distance to 
the true signal compared to the incomplete estimator. The L 2 -distance plot for 
the robustness against the PSF radius with a fixed noise level of a = 3 (Fig. [9c]) , 
shows that the estimators are very robust against the PSF radius. Here the 
exact and incomplete estimators are nearly identical in terms of the L 2 -distance 
to the true signal. The image quality improvement is about 20% for both images 
in terms of the L 2 -norm. 


4.4 SOFI results 

Figure [10| shows a comparison of using SMRE as a pre- and postprocessor for 
SOFI and the SOFI image without SMRE on the experimental dataset described 
in Sec. |4.1| Because of the much better performance all results were calculated 


using the ICD variant of Dykstra’s projection algorithm. In fact, it is the only 
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(a) FRC for simulated data and corresponding reconstruction with ctpsf = 4px, a = 1. 
Note that both variants of Dykstra’s algorithm produce nearly identical results. 


^ 0.85 


0.8 


1 1 


■ 

Incomplete Dykstra 
Exact Dykstra 

















_ 







Ml- Incomplete Dykstra 
—-—Exact Dykstra 



























10 20 30 40 

Standard Deviation a 


5 10 15 

PSF width <j ps f [px] 


(b) Robustness against noise level with ( c ) Robustness against PSF radius with 
fixed PSF radius of ctpsf = 4px. fixed noise level a = 3. 


Figure 9: (a) Fourier ring correlation for the images from Fig. [6] Robustness 
against noise strength a [E] and PSF radius ctpsf m based on the L 2 -error 
\\x — a ;*||2 of the reconstruction x with respect to the true signal x* normalized 
by the L 2 -error ||I — x*|| 2 of the simulated image / with varying a or 

(jpsF d(c)D for the Lena test image. The exact implementation uses a maximum 


multi-resolution depth Lsmre = 32 in (a) and Lsmre = 15 in (b) and (c) 
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Figure 10: (a) SOFI second order with SMRE as preprocessor on each raw image, 
(b) SMRE applied as postprocessor on the final SOFI second order image, (c) 
SOFI second order. The labels of the enlarged insets correspond to the region 
and reconstruction method respectively. As SMRE method we use ICD. 
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(a) Mean FRC: The resolution is 129 nm for the SOFI result, 
95 nm for SOFI with prior SMRE, and 127 nm for SMRE 
with prior SOFI. 

power spectral density 



10 4 10 3 10 2 
[nm] 


(b) Normalized power spectral densities 


Figure 11: Fourier ring correlation and power spectral density for the images 


in Fig. 10 The resolution in the result with prior (SOFl 2 (SMRE)) or posterior 


deconvolution (SMRE(SOFl2)) is better compared to the resolution without 


SMRE. 
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variant for which these computations are feasible. Due to the structure of the 
expected signal we use the generic L 2 norm as regularization. 

In the following we denote by SOFI 2 (ICD) the usage of the ICD variant of the 
SMRE as a preprocessor on each frame and by ICD(SOFI 2 ) the usage of SMRE 
as a postprocessor on the final SOFI image of second order. If we rather want 
to stress that we use SMRE at all, we write SOFI 2 (SMRE) or SMRE(SOFI 2 ), 
respectively. 

The comparison of SOFI 2 (ICD) with ICD(SOFI 2 ) and SOFI without pre- 
or postprocessor in Fig. [To| shows a resolution improvement by using SMRE as 
a preprocessor for SOFI 2 . This leads to a greatly increased numerical effort, 
compared to SOFI 2 and ICD(SOFI 2 ). For ICD(SOFI 2 ) the workload shrinks 
from executing SMRE on several thousand frames to applying the SMRE concept 
to only one, i.e. the final, image. The FRC and the power spectral density (PSD) 
shown in Fig. 11 suggest a resolution improvement of about 35% for SOFI 2 (ICD). 
The resolution is estimated as intersection of the correlation curve with the 
2cr threshold. The resolution is 129 nm for the SOFI 2 result, 95 nm for SOFI 2 
with prior SMRE, and 127 nm for SMRE with prior SOFI 2 . The FRC and PSD 
curves in Fig. EH are averages over several realizations of the reconstructions. 
The SOFI images were calculated from 1000 frames randomly chosen from the 
full data set. Both methods indicate that the resolution in the result with prior 
SMRE is the highest. 


5 Discussion 


We evaluated the performance of two different CUDA-implementations of Dyk¬ 
stra’s algorithm. The exact variant implements Dykstra’s algorithm as defined 
in the mathematical literature with adaptions to the CUDA architecture while 
the incomplete variant only computes an approximation to Dykstra’s projection 
by restricting the possible subsets. In numerical comparisons this approximation 
seems to introduce only a small error. The result is nearly as good as the exact 
calculation while being much faster. 

For the incomplete variant a rigorous proof of the convergence of the ADMM 
is still missing, but the numerical evidence for the convergence is encouraging. 

The exact implementation has significant disadvantages in terms of efficiency 
compared to the incomplete implementation. While the algorithm generally 
achieves good efficiency and populates all the multiprocessors, it is limited by 
the bandwidth of the PCIe bus, as data has to be constantly transferred between 
device and host. The memory transfers are necessary, because with increasing 
image size Lj and multi-resolution depth i^sMRE the total amount of memory 
required for storing all q s variables easily exceeds the global memory available 
on the device. In that sense, the exact method cannot be implemented in a 
GPU-only fashion. 

The usability of our SMRE approach in cooperation with super-resolution 
optical fluctuation imaging was shown with different datasets. Quite similar 
SMRE methods were deployed on images obtained from STED microscopy (Hell 
2007 Frick, Marnitz, and Munk 2013). 


However, using SMRE for large-scale, high-throughput applications, e.g. as 
preprocessor in the SOFI method, is only feasible with the incomplete variant 
of Dykstra’s algorithm which we have introduced in this paper. The resolution 
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improvements are in line with those obtained from Fourier-reweighted SOFI 
images where the deconvolution is done in frequency space. 

The lesson to be learned from the success of the ICD is that the most crucial 
feature is the way the set of possible subsets is sampled. A similar problem 
arises in Monte Carlo simulations of e.g. protein folding or multi-dimensional 
spin systems. Typically the number of possible states of the system grows (for 
practical purposes) exponentially with the system size and thus is impossible 
to be completely enumerated. Instead Monte Carlo methods rely on a clever 
sampling of the state space by using the appropriate probability distribution, 
which depends on the chosen algorithm. For Metropolis-Hastings (Metropolis 
et al. 1953) it is the Boltzmann distribution and for the Wang-Landau (Wang 
and Landau 2001) or umbrella sampling it is the density of states, although it 
has to be constructed iteratively during the course of the simulation. With that 
in mind the positive results of the incomplete variant of Dykstra’s algorithm is 
much less surprising. Given the fact the we only used the most obvious choice 
of a CUDA-friendly restriction of the shapes of the subsets we expect further 
speedups just by revisiting the way we have partitioned an image. 

There are certainly other aspects we still have to address in order to get a 
complete picture of the advantages and disadvantages of our statistical multi¬ 
resolution estimator and which we either had to omit or could touch only briefly 
in this paper. From a technical point of view one still has to investigate the 
influence of the choice and maximum size of the subsets and the relationship of 
the termination criteria of the ADMM and Dykstra iterations to the quality of 
the reconstructed image. The SOFI method is only one way of super-resoluton 
microscopy and is mostly applied to widefield images. It is certainly of interest 
whether SMRE can be combined with other microscopy techniques, especially 
the parallel array microscope (PAM) (Heintzmann et al. 2001 De Beule et al. 


2011), which provides high-speed, high-throughput confocal imaging. The SMRE 
algorithm is not restricted to 2D images. Therefore, a very interesting extension 
of the method would be to use 3D subsets for true 3D reconstructions. Beyond 
all of these technical issues we would also like to test the performance of SMRE 
on other physical problems. For instance, deconvolution is an important topic in 
the context of fluorescence lifetime microscopy (Verveer, Squire, and Bastiaens 
2001) which is highly important for imaging “single molecules in action”. The 


fluorescence lifetime crucially depends on the details of the chemical environment 
of a molecule and thus is able to track subtle interactions, for instance among 
proteins. 
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A CUDA arch 


A.l CUDA programming model 

For the convenience of the reader we first briefly recall the CUDA-specific terms 
and the programming model. CUDA and OpenCL allow to harness the compute 
power of modern GPUs for entirely non-graphical purposes. The concepts are 
illustrated in Fig. [3j GPUs in their current design offer thousands (NVIDIA 
Cooperation 2012]) of independent compute cores, each executing a thread and 
each with its own dedicated amount of memory, the registers. To distinguish the 
CPU, running the application, and the GPU executing data-parallel, compute¬ 
intensive tasks the notion of host (CPU) and device (GPU) was introduced. 
Depending on the exact architecture a varying number of cores on the GPU 
forms a streaming multiprocessor (SM). The SM manages memory requests 
and issues instructions for threads. The threads are addressed in groups of 32 
threads, which is called a warp. In principle all threads of a warp execute the 
same instruction. Multiple warps form a thread block. A thread block resides 
on a single SM. Threads of the same thread block can exchange data via the 
shared memory of the SM. In order to process all of the input data, e.g. all 
pixels of an image, many thread blocks have to be started forming a so called 
grid. Data exchange between different thread blocks in a grid has to be done via 
the global device memory i.e. the VRAM. This communication inflicts already 
some latency which is hidden by a high level concurrency (’’high occupancy”) 
due to starting thousands of or more threads. The bottleneck of many CUDA 
applications is the PCIe connection between host and device since it limits the 
memory transfer. This becomes an even bigger problem if the compute node 
has more than one GPU. 

CUDA can be considered as a proprietary extension of the C programming 
language to GPGPU computing by adding the necessary keywords to handle 
data-parallel multithreading. The API offers a way to call a kernel with specified 
thread block and grid dimensions which is executed on the GPU. Basically, 
a kernel is an ordinary function identified by the qualifier __giobai__. In the 
simplest case each instance of a kernel processes an independent data element 
per computation. The data element corresponding to a given thread, which has 
been started upon kernel launch, can be calculated from the thread’s position in 
the grid of threads by means of the built-in variables Threadidx, BiockDim, Biockidx 
and GridDim. If threads have to frequently access the same data elements or 
exchange data with other threads, the shared memory of the SM can be used 
as a manually managed cache. Access to the shared memory is coalesced if all 
threads of a warp target the same cache line. Otherwise the access is serialized in 
as many accesses as there are cache lines addressed. Shared memory is allocated 
by prepending the __shared__ qualifier to a variable in the scope of the kernel. 
Synchronization among threads of a thread block is done by the __syncthreads() 
method. This is especially necessary for data exchange via the shared memory. 

The kernel call itself runs asynchronous to the host part of the program, this 
means the next instruction after the kernel launch is immediately executed. If 
this is e.g. a memory transfer of the results of the kernel just started, it will 
probably not yield the corrected results. Thus the host has to wait until the 
kernel finishes, this is achieved by using the cudaDeviceSynchronizeO directive. 

In some situations, as we will show in the following sections, we can make 
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use of this asynchronous behavior to achieve an optimal occupancy of the 


device. With the advent of the Kepler architecture (NVIDIA Cooperation 2012) 
Nvidia introduced an improved mechanism, called Hyper-Q, to handle concurrent 
execution of kernels and memory transfer via the use of streams. A CUDA stream 
is a sequence of operations that execute in issue-order on the GPU. Operations 
in different streams may run concurrently, before Hyper-Q the concurrency was 
limited as the CUDA streams were multiplexed into a single hardware work 
queue. The Kepler architecture provides 32 work queues with no inter-stream 


dependencies. For a recent example of using Hyper-Q see (Tran et al. 2014). 
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# Listing of Parameters 


subsection input data 

# path to the .tif image 
set image 

# enforces first constraint 

set alphal = 1.2 

# enforces second constraint 

set alpha2 = 0.12 

# Estimate of gaussian noise standard deviation. 

# If simulate , gaussian noise will be added to the image . 
set gaussian noise = 0 

# intensity of regularisation 
set regularization = 1.0 

# stabilises first constraint 

set rho1 = 6.192 

# stabilises second constraint 

set rho2 = 1.8 

# PSF spread 

set sigma = 3 

end 


subsection output 

# where should we put the output image? Will be a tiff image 
set output image = control.tif 

# save preliminary results of the output image 

set control = false 

end 


subsection program flow control 

# largest patch edge length if not using small dykstra in 
approximation 

set MRdepth = 15 

# do a small dykstra in approximation 

set approx = false 

# Set maximum number of iterations 
set maximum iterations = 10000 

# Reporting progress in intervals of ... Iterations 

set report interval = 1 

# Finish when Ix_r - x_{r-l}| < tolerance 

set tolerance = le-3 

end 
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50 

52 


subsection simulate dataset from real image 

# If set to false the input is treated as real 

# if true input will be treated as test image 

# blurring and noise are added 
set simulate = true 

# If false simulated noise has a constant seed 

# if true the seed is taken from the clock 
set time seed = false 
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end 


data 

where 
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